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Life and language are discrete combinatorial systems (DCSs) in which the basic building blocks 
are finite sets of elementary units: nucleotides or codons in a DNA sequence and letters or words 
in a language. Different combinations of these finite units give rise to potentially infinite numbers 
of genes or sentences. This type of DCS can be represented as an Alphabetic Bipartite Network 
(a-BiN) where there are two kinds of nodes, one type represents the elementary units while the other 
type represents their combinations. There is an edge between a node corresponding to an elementary 
unit u and a node corresponding to a particular combination u if it is present in v. Naturally, the 
partition consisting of the nodes representing elementary units is fixed, while the other partition 
is allowed to grow unboundedly. Here, we extend recently analytical findings for a-BiNs derived 
in [Peruani et al., Europhys. Lett. 79, 28001 (2007)] and empirically investigate two real world 
systems: the codon-gene network and the phoneme- language network. The evolution equations 
for a-BiNs under different growth rules are derived, and the corresponding degree distributions 
computed. It is shown that asymptotically the degree distribution of a-BiNs can be described as a 
family of beta distributions. The one-mode projections of the theoretical as well as the real world 
Q-BiNs are also studied. We propose a comparison of the real world degree distributions and our 
theoretical predictions as a means for inferring the mechanisms underlying the growth of real world 
systems. 



I. INTRODUCTION 

Two of the greatest wonders of evolution on earth, 
life and language, are discrete combinatorial systems 
(DCSs) Q. The basic building blocks of DCSs are finite 
sets of elementary units, such as the letters in language 
and nucleotides (or codons) in DNA. Different combina- 
tions of these finite elementary units give rise to a po- 
tentially infinite number of words or genes. Here, we 
propose a special class of complex networks as a model 
of DCSs. We shall refer to them as Alphabetic Bipartite 
Networks (a-BiNs) in order to signify the fact that the 
set of basic units, in both human and genetic languages, 
can be considered as an Alphabet. 

The a-BiNs are a subclass of networks where there are 
two different sets (partitions) of nodes: the bipartite net- 
works. An edge, in a bipartite network, links nodes that 
appear in two different partitions, but never those in the 
same set. In most of the bipartite networks studied in 
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the past both the partitions grow with time. Typical 
examples of this type of networks include collaboration 
networks such as the movie-actor d, [E 3, H, B, article- 
author 0, H, Q, and board-director [101 [ill networks. 
In the article-author network, for instance, the articles 
and authors are the elements of the two partitions also 
known as the ties and actors respectively. An edge be- 
tween an author a and an article m indicates that a has 
co-authored m. The authors a and a' are collaborators if 
both have coauthored the same article, i.e., if both are 
connected to the same node m. The concept of collabora- 
tion can be extended to represent, through bipartite net- 
works, several diverse phenomena such as the city-people 
network [l^ . in which an edge between a person and a 
city indicates that the person has visited that particular 
city, the word-sentence [13, [13], bank-company [ll| or 
donor-acceptor networks that account for injection and 
merging of magnetic field lines [l^ . 

Several models have been proposed to synthesize the 
structure of these bipartite networks, i.e., when both the 
partitions grow unboundedly over time ^2; iJi [3j [S [l3]- 
It has been found that for such growth models, when 
each incoming tie node preferentially attaches itself to 
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the actor nodes, the emergent degree distribution of the 
actor nodes follows a power-law ^] . This result is remi- 
niscent of unipartite networks where preferential attach- 
ment results in power-law degree distributions 
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On the other hand, bipartite networks where one of the 
partitions remains fixed over time have received compar- 
atively much less attention. Since the set of basic units 
in DCSs is always finite and constant, a-BiNs have one 
of its partitions fixed, the one that represents the basic 
units (e.g. letters, codons). In contrast, the other par- 
tition, that represents the unique discrete combinations 
of basic unit (e.g., words, genes), can grow unboundedly 
over time. Notice that the order in which the basic units 
are strung to form the discrete combination is an impor- 
tant and indispensable aspect of the system, which can 
be modeled within the framework of a-BiNs by allow- 
ing ordering of the edges. Nevertheless, the scope of the 
present work is limited to the analysis of unordered com- 
binations. Here we assume a word to be a bag of letters 
and a gene a multiset of codons. Fig. [1] illustrates the 
concepts through the example of genes and codons. 

A first systematic and analytical study of a-BiNs has 
been presented in [l^, where a growth model for such 
networks based on preferential attachment coupled with 
a tunable randomness component has been proposed and 
analyzed [13]. For sequential attachment, i.e., when the 
edges are incorporated one by one, the exact expres- 
sion for the emergent degree distribution has been de- 
rived. Nevertheless, for parallel attachment, i.e., when 
multiple edges are incorporated in one time step, only 
an approximate expression has been proposed. It has 
been shown that for both the cases, the degree distribu- 
tion approaches a beta-distribution asymptotically with 
time. Depending on the value of the randomness pa- 
rameter four distinct types of distributions can be ob- 
served; these, in increasing order of preferentially, are: 
(a) normal distribution, (b) skewed normal distribution 
with a single mode, (c) exponential distribution, and (d) 
U-shaped distribution. 

In this article, we briefly review these findings and 
extend the analytical framework. We derive the exact 
growth model for parallel attachment and study the de- 
gree distribution of the one-mode projection of the net- 
work onto the alphabet nodes. These analytical find- 
ings are further applied to study two well-known DCSs 
from the domain of biology and language. We observe 
that in the codon-gene network (codons are basic units 
or alphabet, genes are the discrete combinations), the 
higher the complexity of an organism, the higher the 
value of the randomness parameter. Similarly, the theory 
can also satisfactorily explain the distribution of conso- 
nants over the languages of the world studied through 
the phoneme-language network (phonemes are the basic 
units and the sound systems of languages are the dis- 
crete combinations). Nevertheless, the study also reveals 
certain limitations of the current growth models. For in- 
stance, we observe that the topological characteristics of 
the network of co-occurrence of phonemes, which is the 
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FIG. 1: DNA modeled as a bipartite network a-BiN. The 
set U consists of 64 codons, whereas the set V of genes is 
virtually infinite. Multiple occurrences of a codon in a gene 
have been represented here by multi-edges. For instance, the 
codons 'ACG' and 'AAU' have respectively 2 and 3 edges 
connecting to the node genea. Alternatively, this could have 
been represented by single edges with weights 2 and 3, while 
the weight of the other edges would be equal to 1. 



one-mode projection of the aforementioned network, is 
different from the theoretical predictions. This indicates 
that although the simple preferential attachment based 
growth model succeeds in explaining the degree distri- 
bution of the basic units of a-BiN, the theory fails to 
describe the one-mode projection, which indicates that 
the real dynamics of the system is much more complex. 

The rest of the article is organized as follows: Sec. |TI] 
formally defines a-BiN and introduces two growth mod- 
els and their corresponding theoretical analysis. The 
two real networks - codon-gene and phoneme-language 
- their topology and comparison with the theoretical 
model are described in Sec. IIII Al and IIII Bl respectively. 
In Sec. IIVI we summarize the obtained results, discuss 
the broader consequences of the present work and pro- 
pose some applications and alternative perspectives on 
the same. 



II. THEORETICAL FRAMEWORK FOR a-BINS 

A. Formal definition and modeling 

A bipartite graph G is a 3-tuple {U, V, E), where U and 
V are mutually exclusive finite sets of nodes (also known 
as the two partitions) and ECU x F is the set of edges 
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that run between these partitions. We can also define 
E as a. multiset whose elements are drawn from U x V. 
Clearly, this construction allows multiple edges between 
a pair of nodes and the number of times the nodes u £ U 
and V € V are connected can be assumed to be the weight 
of the edge {u,v). Note that although wc are defining E 
to be a set of ordered tuples, the ordering is an implicit 
outcome of the fact that edges only run between nodes in 
U and V. In essence, we do not mean any directedness 
of the edges. 

a-BiNs are a special type of bipartite networks where 
one of the partitions represents a set of basic units while 
the other partition represents their combinations. The 
set of basic units is essentially finite and fixed over time. 
Let us denote the unique basic units by the nodes in U. 
Let each unique discrete combination of the basic units 
be denoted as a node in V. There exists an edge between 
a basic unit u G U and a discrete combination v € V iS u 
is a part of v. If u occurs in v w times, then there are w 
edges between u and v, or alternatively, the weight of the 
edge {u,v) is w. Fig. [T] illustrates these concepts through 
the example of genes and codons. 

Notice that the above model overlooks the order in 
which the basic units are strung into a particular dis- 
crete combination. The order can be taken into account 
by labeling the basic units in order of appearance in each 
element of V. However, in this work, we consider only un- 
ordered versions of DCSs. As we shall see subsequently, 
several real world DCSs, such as the phoneme-language 
network, are, in fact, unordered sets. 



B. Growth model for sequential attachment 

In this subsection, we review the results derived in [Tsj 
which apply to sequential as well as parallel attachment. 
While the results for sequential attachment are exact, 
for parallel attachment they represent an approximation. 
In the next subsection the results obtained in are 
extended and the exact derivation for parallel attachment 
is presented. 

The growth of a-BiNs is described in terms of a simple 
model based on preferential attachment coupled with a 
tunable randomness parameter. Suppose that the parti- 
tion U has N nodes labeled as ui to un- At each time 
step, a new node is introduced in the set V which con- 
nects to jj, nodes in U based on a predefined attachment 
rule. Let Vi be the node added to V during the ith time 
step. The theoretical analysis assumes that /i is a con- 
stant greater than 0. This constraint will be relaxed dur- 
ing synthesis of the empirical networks. However, note 
that if the degrees of the nodes in V are sampled from 
a Poisson-like distribution with mean /i, the theoretical 
analysis holds good asymptotically. 

Let A{kj) be the probability of attaching a new edge 
to a node Ui, where /c* refers to the degree of the node Ui 
at time t. defines the attachment kernel that takes 



the form: 

where the sum in the denominator runs over all the nodes 
in U, and 7 is the tunable parameter which controls 
the relative weight of preferential to random attachment. 
Thus, the higher the value of 7, the lower the randomness 
in the system. Since in a bipartite network the sum of the 
degrees of the nodes in the two partitions are equal, the 
denominator in the above expression is equal to ^'yt-\-N . 
Note that the numerator of the attachment kernel could 
be rewritten as /c* -f a, where a = I/7 is a positive con- 
stant usually referred to as the initial attractiveness [l9[ . 

Physically this means that when a new discrete com- 
bination, say a gene, enters the system, it is always as- 
sumed to have /i basic units, e.g., a chain of /i codons. 
The patterns of the codons constituting the newly en- 
tered gene depends on the prevalence of the codons in the 
pre-existing genes as well as a randomness factor I/7. At 
this point it is worthwhile to distinguish between a few 
basic sub-cases of the growth model. When /i = 1, addi- 
tion of a node in V is equivalent to addition of one edge 
in the network and thus the edges attach to the nodes 
in J7 in a sequential manner. However, for /i > 1 addi- 
tion of an edge is no longer a sequential process; rather 
/i edges are added simultaneously. We refer to the for- 
mer process as sequential attachment and the latter as 
parallel attachment. Depending on the underlying DCS, 
the parallel attachment process can be further classified 
into two sub-cases. If it is required that the ^ nodes cho- 
sen are all distinct, then we call this parallel attachment 
without replacement. On the other hand, if Vi is allowed 
to attach to the same node more than once, we refer to 
the process as parallel attachment with replacement [4ll |. 
Thus, parallel attachment without replacement leads to 
a-BiNs without multi-edges or weighted edges, while par- 
allel attachment with replacement results in a-BiNs with 
multi-edges. The two cases collapse for the case of se- 
quential attachment. To motivate the reader further, we 
provide some examples of natural DCSs from each of the 
aforementioned classes. 

• Sequential attachment: Since in the sequential at- 
tachment model, every node in V has only one edge, 
it is not a discrete combination at all. Rather, each 
incoming Vi is a reinstantiation of some basic unit 
Uj. However, think of a system where U is the set 
of languages and V is the set of speakers, and an 
edge between u € U and v G V implies that u 
is the mother tongue of v. Although not a DCS, 
these type of "class and its instance" systems are 
plentiful in nature and can be aptly modeled using 
sequential attachment. 

• Parallel attachment with replacement: Any DCS 
modeled as a sequence of the basic units can be 
thought to follow the "with replacement" model. 
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For instance, a gene can have many repetitions of 
the same codon and similarly, there may be multi- 
ple occurrences of the same word in a sentence. 

• Parallel attachment without replacement: A DCS 
that is a set of the basic units can be conceived as an 
outcome of the "without replacement" model. For 
instance, the consonants and vowels (partition U) 
that form the repertoire of basic sounds (phonemes) 
of a language (partition V), proteins (U) forming 
protein complexes (V), etc. 

In this work, we focus on the topological properties 
of a-BiNs that are synthesized using the sequential and 
parallel attachment with replacement. Nevertheless, in 
section IIII Bl we also present some empirical results for 
the parallel attachment without replacement model in 
the context of the phoneme-language network. 

Any a-BiN has two characteristic degree distributions 
corresponding to its two partitions U and V. Here we 
assume that each node in V has degree /i and concentrate 
on the degree distribution of the nodes in U. Let pk,t be 
the probability that a randomly chosen node from the 
partition U has degree k after t time steps. We assume 
that initially all the nodes in U have a degree and there 
are no nodes in V. Therefore, 



Pk,0 — ^kfi 



(2) 



Here, S represents the Kronecker symbol. It is interest- 
ing to note that unlike the case of standard preferential 
attachment based growth models for unipartite (e.g., the 
BA model [13) and bipartite networks (e.g., [1]), the de- 
gree distribution of the partition U in a-BiNs cannot be 
solved using the stationary assumption that in the limit 
t oo, pk.t+i = Pk,t- This is because the average de- 
gree of the nodes in U, which is fit/N, diverges with t, 
and consequently, the system does not have a stationary 
state. 

In [Tsj it has been shown that pk^t can be approximated 
ior fi <^ N and small values of 7 by integrating: 



Pk.t+i = (1 - Ap{k, t))pk,t + Ap{k - 1, t)pk-i,t 
where Ap{k,t) is defined as 



Ap (fc, i) 



'YfJ.t+N 





for 0<k<^t 
otherwise 



(3) 



(4) 



for t > while for t = 0, Ap{k,t) = {fi/N)6k^o. The 
numerator contains a fi because at each time step there 
are fi edges that are being incorporated into the network 
rather than a single edge. The solution of Eq. ^ with 
the attachment kernel given by Eq. ^ reads: 



Pk,t 



rim^o ( 



7m - 



(5) 



As already mentioned in [l^, Eq. ^ cannot describe 
the stochastic parallel attachment exactly because it ex- 
plicitly assumes that in one time step a node of degree k 
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FIG. 2: The four possible degree distributions depending on 
7 for sequential attachment (and approximated expression for 
parallel attachment). Symbols represent average over 5000, in 
(a)-(c), and 50000, in (d), stochastic simulations. The dashed 
curve is the theory given by Eq. From (a) to (c), to = 

1000, iV = 1000 and = 20. (a) at 7 = 0, p{k,t) becomes a 
binomial distribution, (b) 7 — 0.5, the distribution exhibits 
a maximum which shifts with time for 0<7<1. (c)7 = l, 
p{k, t) does no longer exhibit a shifting maximum and the 
distribution is a monotonically decreasing function of k for 
1 < 7 < {N/fi) - 1. (d) 7 = 2500, to = 100, = 1000 and 
— p{k,t) becomes a u-shaped curve for 7 > {N/y) — 1. 



can only get converted to a node of degree fc + 1 . Clearly, 
the incorporation of /i edges in parallel allows the possi- 
bility for a node of degree k to get converted to a node 
of degree k + fj,. So, for /i > 1, Eq. (O is just an approx- 
imation of the real process for ij, ^ N and small values 
of 7. However, for ^ = 1, i.e. for sequential attachment, 
Eq. ([5]) is the exact solution of the process. 

Interestingly, for 7 > 0, Eq. ^ approaches, asymp- 
totically with time, a beta-distribution as follows. 



Pk,t ^ [k/ty' (1 - k/tY' 



(6) 



Here, C is the normalization constant. By making use 
of the properties of beta distributions, we learn that de- 
pending on the value of 7, pk,t can take one of the fol- 
lowing distinctive functional forms. 

a) 7 = 0, a binomial distribution whose mode shifts 
with time, 

b) < 7 < 1, a skewed (normal) distribution which 
exhibits a mode that shifts with time, 

c) 1 < 7 < {N/ii)~l, a monotonically decreasing (near 
exponential) distribution with the mode frozen at fc = 0, 
and 

d) 7 > (N/ii) — 1, a U-shaped distribution with peaks 
at fc = and k = t. 

Fig. [5] illustrates the possible four regimes of Eq. ^ . In 
the next subsection we present a generalization of Eq. ^ 
for ^ > 1, i.e. for parallel attachment, and solve it. 
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FIG. 3: Comparison for random attachment (7 = 0) between 
the approximation given by Eq. ([Sjl (dashed red curve), the 
exact solution given by the integration of Eq. (|lip (solid black 
curve), and stochastic simulations. Symbols correspond to 
average over 500 simulations. In both the figures A'^ = 100. 

(a) corresponds to = 20 while (b) to ^ = 40. The inset in 

(b) shows in log-log scale the deviation of the approximation 
with respect to the exact solution and simulations. 



C. Growth model for parallel attachment with 
replacement 

Recall that for parallel attachment, t refers to the event 
of introducing a new node in V with fi edges. Therefore, 
the correct expression for the evolution of pk.t has the 
form: 



1 = (1 - XI i, i, t)pk-^,t (7) 



Pk,t+ 



where A{k^i^t) represents the probability at time t of a 
node of degree k of receiving i new edges in the next time 
step. The term X]r=i ^if^iht)Pk,t describes the number 
of nodes of degree k at time t that change their degree 
due to the attachment of 1,2, . . ., or edges. On the 
other hand, nodes of degree k will be formed at time 
t + 1 by the nodes of degree fc — 1 at time t that receive 
1 edge, nodes of degree A: — 2 at time t that receive 2 
edges, and so on. This process is described by the term 
J2i=i Mk - i,i,t)pk-i,t- 

Next we derive an expression for A{k,i,t). We start 
out by a simple case: 7 = 0. Since in this case the proba- 
bility for an edge of attaching to a node is independent of 
its degree, if we add fj, edges, the probability for a node of 
receiving a single edge is fi{l / N){1 ~ 1 / N)^^^ , the proba- 



bility of receiving two edges is 



(l/iV)2(l-l/iV) 



H-2 



and for the general case we obtain the expression: 



A{k, i, t) = 



1 

1 

N 



(8) 



To extend this result to 7 > 0, we recall that if we add 
a single edge, the probability for a node of degree k of 
receiving that edge is cj) — {'jk + 1) / {^jt + N), where we 
have assumed that previous to this edge we had added fit 
edges to the nodes in U. Clearly, 1 — is the probability 



FIG. 4: Comparison for strong preferential attachment (7 > 
1) between the approximation given by Eq. ((Sjl (dashed 
red curve), the exact solution given by the integration of 
Eq. (|lip (solid black curve), and stochastic simulations (cir- 
cles) , averaged over 500 runs, for parallel attachment for 7 > 1. 
In both the figures A'' = 100 and fi — 40. (a) corresponds to 
7 = 1 while (b) to 7 = 16. Notice that in (b) the approx- 
imation falls out of the range of the figure, while the exact 
solution given by the integration of Eq. (|lip describes the 
simulation data quite well. 



for the edge to attach to some other node. Taking this 
into account, Eq. ([5]) is generalized for 7 > as 



A{k,i,t) 



f -fk + 1 



1 



7fc ■ 



Inserting expression ^ into Eq. ([7]), we obtain 

Pk,t+1 



1-E 



7fc + 1 
fiyt + N 



■i=l 

M \ / 7(fc — j) + 1 Y 



7fc -I- 1 
H"/t + N 



(9) 
(10) 

Pk,t 



fijt + N 



7(fc-») + l \ 



The terms between parenthesis in the first line of Eq. 
TO|l can be simplified recalling that 



7A: + 1 



1-E 



' 'f, \ (jk + l 



i=l 



1 - 



7A: H- 1 
Jjr/tTN 



Therefore, Eq. PH]) can be rewritten by including i = 
in the sum, whereby we obtain 



Pk,t+1 



E 



f j{k -i) + 1 
fxjt + N 



7(fc - i) + 1 
fijt + N 



Pk-i,t 



(11) 



Note that Eq. (fTTj) is a generahzation of Eq. (2) in [T^, 
which can be obtained from Eq. (jlip by assigning = 1. 

In Fig. [3] we compare for random attachment (7 — 0) 
the approximation given by Eq. ^ (dashed red curve). 
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the exact solution given by the integration of Eq. (fTTj) 
(solid black curve), and stochastic simulations. 

Notice that the approximation, as mentioned above, 
deviates from the exact solution and simulations as /i in- 
creases. Looking at Fig. [3] one may wrongly conclude 
that the exact solution given by Eq. pT|) is a very mi- 
nor improvement over the approximation given by Eq. 
(O. However, for large values of 7 (see Fig. S]), i.e., for 
strong preferential attachment, Eq. ([5]) drastically fails 
to describe the simulation data, while Eq. pT|) accu- 
rately explains the data. To summarize, Figs. [3] and [J] 
validate Eq. pT|) and show that, except when fi ^ N 
and small 7, the only way to describe the degree distribu- 
tion for parallel attachment with replacement is through 
the integration of Eq. (fTTj) . 



D. One- mode projection 

In this section, we analyze the degree distribution of 
the one-mode projection of a-BiNs onto the set U. For- 
mally, for an a-BiN {U,V,E), the one-mode projection 
onto [/ is a graph Gjj : {U, Ejj), where Ui, Uj G U are con- 
nected (i.e., {ui,Uj) G Gu) if there exists a node v & V 
such that (ui^v) G E and {uj,v) G E. If there are w 
such nodes in V which are connected to both ui and Uj 
in the a-BiN G, then there are w edges linking Ui and Uj 
in the one- mode projection Gjj- Alternatively, one can 
conceive of a weighted version of Gjj, where the weight 
of the edge {ui,Uj) is w. In the context of the codon- 
gene network, the one-mode projection is a codon-codon 
network, where two codons are connected by as many 
edges as there are genes in which both of these codons 
occur. The one-mode projection of an a-BiN provides 
insight into the relationship between the basic units. For 
instance in linguistics the one-mode projection of the 
word-sentence a-BiN reveals the co-occurrence of word 
pairs, which in turn provides crucial information about 
the syntactic and semantic properties of the words (see, 
for example, [isl. l20l|). 

In [2l| a general technique for computing the degree 
distribution of the one-mode projection of a bipartite 
network is described. The method has been derived by 
making use of the concept of generating functions. As we 
shall see shortly, this technique is only suitable for esti- 
mating the weighted degree distribution of the one-mode 
projection. 

Here, we propose a novel technique to derive the thresh- 
olded degree distribution of the one-mode projection for 
any arbitrary threshold. We start out by studying first 
the simple cases of the (non-thresholded) degree distri- 
bution of the one-mode projection for sequential and par- 
allel attachnment, to finally focus on the new technique 
to derive the degree distribution of the thresholded one- 
mode projection for parallel attachment. Notice that 
in order to distinguish the degree distributions of the 
one-mode projection from their bipartite counterpart, we 
shall use the symbol p„(fc,t) to refer to the probability 



that a randomly chosen node from the one-mode projec- 
tion of an a-BiN with t nodes in V (i.e., after t time 
steps) has degree k. 

1. Sequential attachment 

Recall that in the sequential attachment based growth 
model only one edge is added per time step and conse- 
quently, every node in V has degree /i = 1. Therefore, 
for any two nodes in U, say Ui and Uj, there is no node 
V ^ V , which is connected to both Ui and Uj (this is 
because degree of v is 1). Thus, for a-BiNs that have 
been grown using the sequential attachment model, the 
one-mode projection is a degenerate graph with N nodes 
and edges. The degree distribution of this network is 

Pu{k,t)^5k,o (12) 

2. Parallel attachment 

Recall that in the parallel attachment model, at each 
time step the node which is added to V has edges. 
Consider a node u € U that has degree k in the a-BiN. 
Therefore, u is connected to k nodes in V, each of which is 
connected to ^ — 1 other nodes in U. Defining the degree 
of a node as the number of edges attached to it, in the 
one-mode projection, u has a degree of q = k{p — 1). 
Consequently, the degree distribution of Gu, Puiq,t), is 
related to pk.t in the following way: 

{Po,t if g = 1 

Pk=q/{p.-i),t if M - 1 divides q \ (13) 
otherwise J 

Fig. [5] shows a comparison between stochastic simula- 
tions (circles) and Eq. (solid black curve). Notice 
that this mapping simply implies that Pu{q = 0,t) = po,t, 
Pu{q = fi- l,t) = pi^t, Pu{q = 2(/i - l),t) = p2,t, 
Puiq = j(M ^ 1)7^) — Pj,t- The same result can be de- 
rived by using the generating function based technique 
described in Eq. 70 of [2l[. It is worth noticing that 
q is the weighted degree of a node (i.e., the sum of the 
weights of all the edges incident on a node), and there- 
fore, does not give any information about the number of 
distinct neighbors a node has. 

3. Thresholded degree- distribution for parallel attachment 

Weighted graphs, such as the one-mode projections of 
a-BiNs, can be converted to corresponding unweighted 
version by the process of thresholding. A thresholded one- 
mode projection graph (thresholded Gu) is constructed 
by replacing every weighted edge in Gu by a single edge 
iff the weight of that edge exceeds the threshold value 
r; otherwise, the edge is deleted. Thresholded degree 



7 




FIG. 5: Comparison between stochastic simulations for the 
one- mode projection (circles) and Eq. (|13|l (solid black curve). 
In both the figures A'^ = 500 and 7 = f . Circles correspond 
to averages over fOOO simulations. In (a.) = 5 while in (b) 
M = f5. 



distributions are more popular in the complex network 
literature, than their weighted counterparts (see, for ex- 
ample, dOjIlll)- We shall denote the thresholded degree 
distribution at threshold r as Pu{q, t; r). 

Let us start by considering two nodes u and u' in U 
with degrees fc„ and fc„' , respectively. We now try to de- 
rive an expression for the probability fc,j' , m) that 
there are exactly m nodes in V that are linked simulta- 
neously to both u and u'. In other words, p(A;„,fc„/,m) 
is the probability that the number of edges running be- 
tween u and m' is to, given that the degrees of the nodes 
are fc„ and ku' ■ Let us assume that the /i nodes that 
each node v € V is connected to, are all distinct. By the 
definition of the growth model for a-BiNs, the event of 
u being connected to a node v is independent of u' being 
connected to the same node. Therefore, the probability 
that a randomly chosen node v V is connected to u 
is ku/t and the probability that it is connected to u' is 
ku' /t. Recall that t refers to the number of nodes in V. 
Thus, the probability that v is connected to both u and 
u' is kuk'u/t^. Therefore, the probability that u and u' 
share to nodes in V takes the form: 



p{ku,ku',m) 



h h I 



1 - 



k h I 



From Eq. (|14p . the probability for u and u' of sharing an 
edge in thresholded Gu is easily computed as: 



p{ku,ku';m > t) = 



p{ku,ku',m) 



(15) 



Consequently, in the thresholded Gu, the expected de- 
gree £) of a node u whose degree is k in the a-BiN is 
given by: 




(a) 



k 




FIG. 6: Comparison between stochastic simulations for the 
one-mode projection at different times (symbols) and Eq. (|17p 
(solid curve). In (a) r = 0, Af = 1000, ^ = 5, 7 = 1. The 
circles and the red curve correspond to t = 20, while the 
squares and the black curve to t = 100. In (b) r = 10, 
= 100, /i = 20, 7 = 1.5. The circles and the red curve 
correspond to t = 50, while the squares and the black curve 
to t = 100. 



the thresholded one-mode projection. Thus, the degree 
distribution of the thresholded Gjj is computed as: 



Pu{q,t; 



J2 

q=[_D(k,T)\ 



(17) 



where the function [aj returns the largest integer smaller 
than a. 

Fig. [6] shows a comparison between Eq. (fTT)) (solid 
curves) and stochastic simulations (symbols) for the one- 
mode projection at different times. The implementation 
of Eq. (|17p was done by summing over the p^.f obtained 
from the stochastic simulations of the corresponding a- 
BiN according io q= \ D{k, r)J , as indicated by Eq. dH 



4- One-mode kernel 



D{k,T) = N'^pi^tp{k,i;m > r) 



(16) 



Notice that then pk^t can be interpreted as the probability 
of finding a randomly chosen node with degree D{k, t) in 



Until now we have been describing growth models for 
the a-BiNs. The unipartite network Gjj is obtained by 
projecting the a-BiN onto the set of nodes U. We shall 
now attempt to derive a kernel for the growth of the net- 
work Gu, whereby we can construct Gu directly with- 
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out constructing the underlying a-BiN. Consider a node 
Vt G V that has been introduced in the a-BiN in the 
t*'' step. There are fj, nodes in U to which vt gets con- 
nected. Let us assume that Vt is connected to no node 
in U more than once. This fact is true in the "paral- 
lel attachment without replacement" model that will be 
described in greater details in Sec. IIII Bl However, as 
discussed earlier, if /i, <C A*" and 7 is small, it is quite 
reasonable to make this assumption even in the case of 
"parallel attachment with replacement" model. 

Introducing vt in the a-BiN is equivalent to introduc- 
ing a clique (complete graph) of size n in Gjj- This is 
because all the nodes that are connected to vt in the 
a-BiN are connected to each other in Gjj by virtue of 
sharing a common neighbor vt- Note that this docs not 
prohibit these fi nodes from having previous connections. 
The growth process is such that multiple edges, or equiv- 
alently edge weights between two nodes larger then 1 can 
occur. 

Let us denote the degree of a node Ui in (the non- 
thresholded) Gjj after t steps as qi^t- As discussed in the 
previous subsection, qi^t = {l^ — ^)ki.t, where ki^t is the 
degree of Ui in the corresponding a-BiN after t steps. 
Noticing the fact that in the a-BiN the n nodes are cho- 
sen independently of each other solely based on the at- 
tachment kernel, we can define a kernel for selecting a set 
of ijl nodes in Gu as follows. 

A{qaj,qbA,---)= n (18) 
j=a,b,... 

where a,b, . . . denotes a randomly chosen set of ^ nodes 
in Gjj- Substituting the expression provided in Eq. ([T|) 
for the preferential attachment based kernel we obtain: 



n 



7/(Ai - l)9j.t + 1 



A{qa,uqb,t,...)- ^ 

(19) 

Below we summarize the growth model for the one-mode 
projection of the a-BiN 

• Select a set of fj, nodes a,b, . . . with the probability 
A{qa.t^i,qb,t-i, . . . ) as described by Eq. HH). 

• Introduce edges between every pair of the chosen 
set a,b, . . . . 

• Advance time by a unit and repeat the process. 

We assume an initial condition qi = for all i. Alter- 
natively, but also equivalently, the above growth model 
can be described as choosing fj, nodes independently, each 
with probability A(qi^t/{tJ' ~ 1)) and then adding edges 
between them. 

Fig. [7] plots the degree distribution obtained from the 
one-mode kernel and the degree distribution of the one- 
mode projection of the Ui nodes of the a-BiN built with 
the same parameters. We can see that the one-mode ker- 
nel gives quite similar degree distribution as one-mode 
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FIG. 7: Comparison between the degree distribution obtained 
from the stochastic simulation of Eq. (|19[) . averaged over 
1000 runs, and the one-mode projection of the a-BiN obtained 
using Eq. p3p . averaged over 100000 runs, with = 50, 
/I = 5, 7 = 0.5 a.tt — 100. Circles correspond to the one-mode 
kernel degree distribution, i.e., Eq. (I19|l . while the stars are 
the one- mode projection of the a-BiN. 



projection of the bipartite network. The primary obser- 
vation from this analysis is that the kernel of the uni- 
partite growth model has the same form as that of the 
bipartite growth model, with a scaling of the parameter 
7 by a factor of 1) in the former. This implies that 

as fi increases, the extent of degree-based preference de- 
creases in the one-mode projection. The analysis, never- 
theless is valid only for the "without replacement" model 
and holds approximately for the "with replacement" for 



III. REAL WORLD a-BIN 

A. CoGNet: the codon-gene network 

As complete genomes of more and more organisms 
are sequenced, phylogenetic trees reconstructed from ge- 
nomic data become increasingly detailed. Codon usage 
patterns in different genomes can provide insight into 
phylogenetic relations. However, except for some earlier 
work (23)], studies on the codon usage have not received 
much attention. One of the main research issues in this 
context is to understand the influence of randomness in 
the growth pattern of genome sequences in the context 
of biological evolution. A well known random process in 
evolutionary biology is random mutation in a gene se- 
quence. A gene sequence is a string defined over four 
symbols (A, G, T, and C) that represent the nucleotides. 
A codon is a triplet of adjacent nucleotides (eg. AGT, 
CTA) and codes for a specific amino acid. There are 
only 64 codons. Codon usage in genome sequences varies 
between different phylogenetic groups. 



9 



1. Definition and construction 

We refer to the network of codons and genes as CoGNet 
and represent it as an a-BiN where V is the set of genes, 
i.e., genome of the organisms, and U is the set of nodes 
labeled by the codons. There is an edge (u, v) ^ E that 
run between V and U if and only if the codon u occurs 
in the gene v. Fig. [1] illustrates the structure of CoGNet. 

We have analyzed 8 organisms belonging to widely dif- 
ferent phylogenetic groups. These organisms have been 
extensively studied in biology and genetics [ISl and, for 
our purpose importantly, their genomes have been fully 
sequenced. In Table |T] we list these organisms along with 
a short description and the number of genes (i.e., the car- 
dinality of set V) and codons (i.e., the cardinality of set 
U). The data have been obtained from the Codon Usage 
Database [1^ [2^ . The usage of a particular codon in an 
organism's genome sequence can be as high as one mil- 
lion. In other words, the degree of the nodes in U can be 
arbitrarily large. This, together with the fact that there 
are only 64 nodes in U, presents us with the non-trivial 
task of estimating the probability distribution pk, having 
a very large event space (between and few millions), 
from very few observations (only 64). 

A possible strategy to cope with this situation is 
through binning of the event space. For example, if we 
use a bin size of 10'*, then degree 1 to degree 10^ is com- 
pressed to a single bin which we label as 1, the next 10'* 
degrees are mapped into the bin 2, and so on. Thus, 
if for a particular organism the codon count is m, then 
theoretically, the maximum degree of a codon node can 
be 771, which in turn implies that with a bin size of 10*^, 
there will be m/10^ bins (or possible events) in which 
the 64 data points will be distributed. If all organisms 
are analyzed using the same bin size, depending on the 
length of the organism's genome, i.e., the codon count 
m, one obtains different number of bins. Alternatively, 
the bin size can be set for each organism in such a way 
that the resulting number of bins remains the same for 
all organisms. Thus, if we wish to have b bins for all 
organisms, the bin size for a particular organism will be 
m/b. Here we analyze the data using both the methods: 
fixed bin size and fixed number of bins. 

Apart from binning, another way to cope with the 
problem of data sparseness is to compute the cumulative 
degree distribution P^^t rather than the standard degree 
distribution P^^t is defined as the probability that a 
randomly chosen node has a degree less than or equal to 
k. Thus, 

k 

Pk,t^J2P''i- (20) 

i=0 

The cumulative distribution is more robust to noise 
present in the observed data points, but at the same time 
it contains all the information present in pk^t [13|- Note 
that even though it is a standard practice in statistics 
to define cumulative distribution as stated in Eq. (|20p . in 




FIG. 8: Degree distribution of the codon nodes for Xenopus 
leavis. In (a) a comparison between the real data (symbols) 
and the theoretical p^^t obtained using Eq. (O (black solid 
curve) is shown. The cumulative distribution of the real data 
(symbols) and the theory (black solid curve) is shown in (b). 

complex network literature it is defined as the probability 
that a randomly chosen node has degree "greater than or 
equal to" k. 

FiglSJa) shows a comparison between the empirical 
degree distribution for Xenopus leavis (symbols) and 
the corresponding theoretical distribution predicted by 
Eq.(l5]) at a 7 for which the squared error between the 
two distributions is minimum. FiglHl^b) presents the same 
data, but in terms of the cumulative distribution. 

2. Growth model 

A particular gene does not acquire all its constituent 
codons at a single time instance but evolves from an 
ancestral gene through the process of mutation (addi- 
tion, deletion or substitution of codons) [2^ . Therefore, 
we choose to apply the "sequential attachment" based 
growth model for synthesis of CoGNet. This means that 
we model the CoGNet growth through equations ([3]) and 

For all the CoGNets, the value of N is 64, is 1 and t 
corresponds to the number of codons that appears in the 
genome of the organism. In our model, we have a single 
fitting parameter, 7; The value of 7 is chosen such that 
the difference or error between the distributions obtained 
from the empirical data and the synthesized CoGNet is 
minimized. The error, E, is defined as follows. 

00 

E^Y.Ml)~Ph)\ (21) 

where ^ represents the empirical distribution. 

Fig. [S] shows the cumulative real data and correspond- 
ing theoretical distributions of the eight organisms listed 
in Table m Tabic. HIl lists the values of 7 for two different 
methods of binning: fixed bin count (bin count — 20) 
fixed bin size (bin size = 10^). 

It can be observed that the values of 7 get polarized 
into two distinct groups. The value of 7 for binning with 
fixed bin size is much higher for three organisms (between 
1.36 and 2.38), that are simple and primitive, than the 
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TABLE I: List of organisms along with their probable origin time (in Million Years Ago current time) and codon and gene 
counts 



Organism's Name 


Description 


Origin time (MYA) 


Gene count 


Codon count 


IVl yXUCUCCLLS XlitibliUib 


Vjri aiii-negaLive i oo-biiapeLi uacLeiiLiiii 


ozuu 


7491 


989974 


Dictyostelium discoideum 


Soil-living amoeba 


2100 


3369 


1962284 


Plasmodium falciparum 


Protozoan parasite 


542 


4098 


3032432 


Saccharomyces cerevisiae 


Single-celled fungi 


488 


14374 


6511964 


Xenopus laevis 


Amphibian, african clawed frog 


416 


12199 


5313335 


Drosophila melanogaster 


Two- winged insect, fruit fly 


270 


40721 


21393288 


Danio rerio 


Tropical fish, zebrafish 


145 


19062 


8042248 


Homo sapiens 


Bipedal primates. Human 


2 


89533 


38691091 



TABLE IL The values of 7 that yield best fit for the degree distribution under the two different bmmng strategies. 



Organism's Name 


Best 7 (fixed bin size) 


Best 7 (fixed bin count) 


Myxococcus xanthus 


2.35 


2.1 


Dictyostelium discoideum 


2.38 


2.57 


Plasmodium falciparum 


1.36 


1.81 


Saccharomyces cerevisiae 


0.35 


0.34 


Xenopus laevis 


0.11 


0.11 


Drosophila melanogaster 


0.28 


0.2 


Danio rerio 


0.14 


0.1 


Homo sapiens 


0.20 


0.09 



rest (between 0.11 and 0.35) which are more complex and 
came into existence at a later stage of evolution. In or- 
der to test whether bin size might influence the value of 
7, the experiments were repeated with various bin sizes. 
The analysis reveals that the polarization of the organ- 
isms into two classes based on the value of 7 is almost 
independent of the bin size. 



We conclude that at least at the level of codon usage 
in Myxococcus xanthus, Dictyostelium discoideum, and 
Plasmodium falciparum the degree of randomness dur- 
ing codon selection is much lower than in Saccharomyces 
cerevisiae, Xenopus laevis, Drosophila melanogaster, 
Danio rerio, and Homo sapiens. These findings are prob- 
ably correlated to the origin time and the evolutionary 
processes that shaped the usage of codons as follows. Let 
us think of evolution as the product of "copy-paste" op- 
erations. In this way, new genes emerge as result of de- 
fectous copy-paste operations where the ancestral genes 
that are being copied are altered by addition, deletion 
or substitution of codons. Thus, copy-paste operations 
without defects lead to a high degree of "preferential at- 
tachment", while mutations/deffects increase the degree 
of randomness. In consequence, we expect newly born 
species/organisms to exhibit a higher degree of random- 
ness than their ancestor, given the greater number of 
mutations experienced by the newly formed organisms. 
The value of 7 in Table. [TTl reflects this fact, and suggests 
that knowledge at the level of codon usage (i.e., 7) can 
be used as a criterion to classify organisms. 



B. PlaNet: the phoneme-language network 

In this section, we attempt to explain the self- 
organization of the consonant inventories through a-BiN 
where the consonants make up the basic units and lan- 
guages are thought as discrete combinations of them. In 
fact, the most basic units of human languages are the 
speech sounds. The repertoire of sounds that make up 
the sound inventory of a language are not chosen arbi- 
trarily. Indeed, the inventories show exceptionally regu- 
lar patterns across the languages of the world, which is 
arguably an outcome of the self-organization that goes on 
in shaping their structures [29| . In order to explain this 
self-organizing behavior of the sound inventories, various 
functional principles have been proposed such as ease of 
articulation [s^ . [Slj . maximal perceptual contrast [s^ and 
learnability [3ll |. The structure of vowel inventories has 
been successfully explained through the principle of max- 
imal perceptual contrast [13, HH . Although there have 
been some linguistically motivated work investigating the 
structure of the consonant inventories, most of them are 
limited to certain specific properties rather than provid- 
ing a holistic explanation of the underlying principle of 
its organization. 



1. Definition and construction 

A first study of the consonant-language network as an 
a-BiN can be found in [s^l • Here we follow the same def- 
initions given in [3^ and refer to the consonant-language 
a-BiN as PlaNet or Phoneme-Language Network. U is 
the universal set of consonants and V is the set of Ian- 
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FIG. 9: Cumulative degree distributions for the empirical data (symbols) and their corresponding theoretical best 7-fits through 
Eqs. ^ and ((S} (solid curve) for the organisms, (a) Myxococcus xanthus, (b) Dictyostelium discoideum, (c) Plasmodium 
falciparum, (d) Saccharomyces cerevisiae, (e) Xenopus laevis, (f) Drosophila melanogaster, (g) Danio rerio, and (h) Homo 
sapiens. 




PlnNel PlioNet 



FIG. 10: Illustration of the nodes and edges of PlaNet and 
PhoNet. 



guages of the world. There is an edge {u, v) ^ E iS the 
consonant u occurs in the sound inventory of the language 
V. On the other hand, the one-mode projection of PlaNet 
onto the consonant nodes is called PhoNet. Fig. [10] illus- 
trates the structures of PlaNet and PhoNet. Note that 
PlaNet is an unweighted bipartite graph, whereas PhoNet 
has been represented as a weighted graph. 

Many typological studies fsol, ^Ss!, of segmental in- 
ventories have been carried out in the past on the UCLA 
Phonological Segment Inventory Database (UPSID) [ssj . 
UPSID records the sound inventories of 317 languages 
covering all the major language families of the world. In 
this work, we have used UPSID consisting of these 317 
languages and 541 consonants found across them, for con- 
structing PlaNet. Consequently, there are 317 elements 
(nodes) in the set V and 541 elements (nodes) in the 
set U. The number of elements (edges) in the set E as 



computed from PlaNet and PhoNet are 7022 and 30412 
respectively. We selected UPSID mainly due to two rea- 
sons - (a) it is the largest database of this type that is 
currently available and, (b) it has been constructed by 
selecting one language each from moderately distant lan- 
guage families, which ensures a considerable degree of 
"genetic" balance. 



2. Topological properties 

Fig. [TT] illustrates the (cumulative) degree distribution 
of U. Since the degree of a language node is nothing 
but the size of the consonant inventory, we take as fi, 
i.e., the degree of each V node, the average number of 
consonants in human languages which is 22. Recall that 
in the theory for a-BiN the degree of each node in V has 
been assumed to be a constant /i. 



3. Growth models 

In order to obtain a theoretical description of the de- 
gree distribution of the consonant nodes in PlaNet (and 
later on PhoNet), we employ the a-BiN growth model 
described in Sec. IIIBI We assume that all the language 
nodes have a degree fi = 22. Clearly, N = 541 is the total 
number of consonant nodes and t = 317 is the total num- 
ber of languages. Thus, 7 is the only free parameter in 
the model. Notice that, by definition, in PlaNet a conso- 
nant can occur only once in a language inventory. There- 
fore, unlike the case of CoGNet, PlaNet is an a-BiN that 
has been constructed using a "parallel attachment with- 
out replacement" scheme. However, we expect the theory 
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FIG. 11: Cumulative degree distribution of U, i.e., the con- 
sonant nodes. Squares correspond to the empirical data, and 
circles to simulations performed with "parallel attachment 
without replacement" with j — U (PlaNetsim)-The solid line 
corresponds to the theoretical solution for "parallel attach- 
ment with replacement" (PlaNettheo) obtained through inte- 
gration of Eq. ([TDJ with 7 = 14. 



FIG. 12: Cumulative degree distribution of the one-mode pro- 
jection of PlaNet (PhoNet). Squares correspond to the em- 
pirical data (Real PhoNet), dash-dotted line to simulations 
of one-mode projection model with "attachment without re- 
placement" using kernel Eq. ((l| (PhoNetsim). The solid curve 
shows the theoretical degree distribution with the "attach- 
ment with replacement" scheme using Eq. (|11|) (PhoNettheo). 



developed in Sec. IIIBl corresponding to "parallel attach- 
ment with replacement" , to be a fairly good approxima- 
tion for the degree distribution of PlaNet. We shall refer 
this theoretical model of PlaNet as PlaNetj^ieo- In order 
to estimate the free parameter 7, the best fit was obtained 
with 7 = 14 (see Fig. [TT])- Since 1 < 7 < N/n = 24.6, 
based on our theoretical analysis we can conclude that 
the attachments are largely preferential in nature and 
the degrees follow a beta distribution with a single mode 
at fc = 1. 

To study the effect of the "parallel attachment with- 
out replacement" scheme, we carry out stochastic simu- 
lations with such a growth model described below. Sup- 
pose that a language node Vi (with degree 22) is added 
to the system and that j < 22 edges of the incoming 
node have already been attached to wi, U2, ■■■,Uj distinct 
consonant nodes. Then, the (j -I- l)th edge is attached 
to a consonant node based on the same preferential at- 
tachment kernel (see Eq. [T]), but applied on the re- 
duced set U—{ui, U2, ■ ■ ■ ,Uj}, i.e., the previously selected 
wi, U2,...,Uj consonant nodes cannot participate in the 
selection process of the {j + l)th edge of Vi. This ensures 
that a consonant node is never chosen twice. We shall 
refer to the degree distributions of the consonant nodes 
obtained in this way as PlaNet sim- The degree distri- 
bution of PlaNetsim has the best match with the degree 
distribution of the real PlaNet when 7 = 14. 

We have calculated the error for the aforementioned 
stochastic simulation model (-Esim) as well as the theory 
of Sec. Ill Bl corresponding to "parallel attachment with 
replacement" (Etheo)- The error has been computed us- 



ing Eq. (PT|) where ^ stands for the degree distribution 
of the real PlaNet. It is found that Esim = 0.0972 and 
Etheo = 0.1170. Since the simulation using the "par- 
allel attachment without replacement" scheme describes 
the structure of consonant inventories better, the error 
in this case is smaller than that for "parallel attachment 
with replacement" . 

4- One-mode projection: PhoNet 

Interestingly, when we reconstruct the one-mode pro- 
jection from either the theory using the "attachment with 
replacement" scheme (PhoNettheo) or stochastic simula- 
tion considering the "attachment without replacement" 
model (PhoNetsim), we cannot match the empirical data. 
Fig. [13 shows the cumulative degree distributions of 
PhoNetsim, PtioNett/ieo and real PhoNet. We have cal- 
culated the error of PhoNetsi,„ and PhoNett/jeo with re- 
spect to the real PhoNet using the Eq. (|2T|) and refer 
them as (Esim) and (Etheo) respectively. Experiments 
reveal that E^tm = 0.1230 and Etheo = 0.1438. The re- 
sults show a larger quantitative difference between the 
curves compared to that between their bipartite coun- 
terparts. It indicates that the one-mode projection has 
a more complex structure than that could have emerged 
from a simple preferential attachment based kernel. 

Anyway, we observe that preferential attachment can 
explain the occurrence distribution of the consonants 
over languages to a good extent. One possible way to ex- 
plain this observation would be that a consonant, which 
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is prevalent among the speakers of a given linguistic gen- 
eration, tends to be more prevalent in the subsequent 
generations with a very little randomness involved in this 
whole process. It is this micro-level dynamics that man- 
ifests itself as preferential attachment in PlaNet. How- 
ever, the fact that the co-occurrence distribution of the 
consonants, i.e., the degree distribution of PhoNet, is not 
explained by the growth model implies that there are 
other organizing principles absent in our current model 
that are involved in shaping the structure of the conso- 
nant inventories. 



IV. DISCUSSION AND CONCLUSION 

In the preceding sections, we have presented growth 
models for discrete combinatorial systems in the frame- 
work of a special class of networks - a-BiNs. To summa- 
rize some of our important contributions, we have 

• proposed growth models for a-BiNs, which are 
based on preferential attachment coupled with a 
tunable randomness component, 

• extended the mathematical analysis presented 
in [l^ and derived the exact expression for the de- 
gree distribution in case of parallel attachment, 

• analytically derived the degree distribution of the 
one-mode projection, 

• and presented case studies for two well-known 
DCSs from the domain of biology and language 
and, thereby, we have validated our analytical find- 
ings against the empirical data. 

It is worthwhile to mention here that there have been 
certain alternative perspectives of viewing the DCS prob- 
lem presented here. One of the most celebrated among 
these is the "Polya's Urn" model (see [s^). In this classi- 
cal model there is an urn initially containing r red and b 
blue balls. One ball is chosen randomly from the urn. 
The ball is then put back into the urn together with 
another new ball (presumably from a collection stored 
elsewhere) of the same color. Hence, the number of total 
balls in the urn grows. Generalizations of this classical 
model have been proposed and solved by Chung et al. 
in [33] ■ In this model, the authors assume that there are 
finitely many urns each containing one ball and the addi- 
tional balls arrive one at a time. With each new incoming 
ball, a new urn is created with a probability p and the 
ball is placed in this newly created urn. With probability 
l-p the ball is placed in an existing urn, where the proba- 
bility that an urn, currently containing m balls, is chosen 
for placing the new ball is proportional to m'^. Note that, 
for p = 0, the number of urns is fixed and finite and the 
model resembles the one we proposed here; however, in 



this case the tunable randomness component 7, which is 
the most important parameter of our model, is absent. 
From the analysis of this model the authors find that for 
ly < 1, the balls in all the urns grow at roughly the same 
rate. For v > 1, one urn dominates, i.e., the probability 
that any new ball goes into that urn is equal to 1. For 
J' = 1, the fraction of balls going into each urn converges, 
though the limit is uniformly distributed in a certain sim- 
plex (see [131 for proofs). Here we have derived the ex- 
act analytical form for the probability distribution of the 
number of urns with a specified number of balls (fc) af- 
ter the addition of t balls. Moreover the proposed model 
takes into account a tunable randomness parameter, as 
well as the case where more than one ball are placed into 
the urns simultaneously (parallel attachment). 

Another important issue that needs a mention is that 
although the reported results are strictly valid for a set 
of basic units fixed in time, we argue here this condition 
can be relaxed. We can find some real systems where the 
set of basic units also grow, however, at a far slower rate 
than the set of their discrete combinations. Under this 
condition we can expect the reported results to approxi- 
mately hold as long as the growth rate of the basic units 
is slow enough. 

Finally, as this study reveals, there are certain limita- 
tions of the growth models proposed here. For instance, 
it has been shown through simulations that the degree 
distribution of the consonant nodes in PlaNet is better 
explained by having a superlinear kernel as opposed to 
a linear kernel introduced here [s^ . An analytical treat- 
ment of such a superlinear kernel should be an interesting 
topic for future research. There are also some limitations 
in the study of CoGNet. Selection of correct binning pol- 
icy to construct the CoGNet is a challenging job. Model- 
ing the CoGNet with parallel attachment where p is the 
average number of codons present in the genes is a direct 
extension of the current work. As a first step, we here 
classified the eight organisms into two sets and we be- 
lieve that our new method can further contribute to the 
reconstruction of phylogenetic relations. Our approach 
may be especially useful for the analysis of such genome 
sequences which are so far only available in fragments 
either due to fragmentary sampling of the biological ma- 
terial or to un-finished sequencing efforts. 



Acknowledgments 

This work was partially financed by the Indo- German 
collaboration project DST-BMBT through grant "Devel- 
oping robust and efficient services for open source Inter- 
net telephony over peer to peer network". N.G., A.N.M. 
and A.M. acknowledge the hospitality of TU-Dresden. 
F.P. acknowledges the hospitality of IIT-Kharagpur and 
funding through grant ANR BioSys (Morphoscale). 



14 



S. Pinker, The Language Instinct: How mind creates lan- 
guage (Perennial, 1995). [26 
J.J. Ramasco, S.N. Dorogovstev, and R. Pastor-Satorras, [27 
Phys. Rev. E 70, 036106 (2004). [28 
D.J. Watts and S.H. Strogatz, Nature 393, 440 (1998). 
R. Albert and A.-L. Barabasi, Phys. Rev. Lett. 85, 5234 [29 
(2000). 

M. Peltomaki and M. Alava, J. Stat. Mech. 1, 01010 [30 
(2006). 

L.A.N. Amaral et al., Proc. Natl. Acad. Sci. 97, 11149 [31 

(2000). [32 

M.E.J. Newman, Phys. Rev. E 64, 016132 (2001). 

A.-L. Barabasi et al., Physica A 311, 590 (2002). [33 

R. Lambiotte and M. Ausloos, Phys. Rev. E 72, 066117 [34 

(2005). 

G. Caldarelh and M. Catanzaro, Physica A 338, 98 [35 
(2004). 

S.H. Strogatz, Nature 410, 268 (2001). [36 
Eubank et al.. Nature 180, 429 (2004). 
R. Ferrer i Cancho and R.V. Sole, Proc. R. Soc. Lond. 
B268, 2261 (2001). [37 
J.-L. Guillaume and M. Latapy, Information Processing 
Letters 90, 215 (2004). [38 
W. Souma, Y. Fujiwara, and H. Aoyama, Physica A 324, 
396 (2003). [39 
K. Sneppen, Europhys. Lett. 67, 349 (2004). 
A.-L. Barabasi and R. Albert, Science 286, 509 (1999). [40 
F. Peruani, M. Choudhury, A. Mukherjee, and N. Gan- 
guly, Europhys. Lett. 79, 28001 (2007). 

S.N. Dorogovtsev and J.F.F. Mendes, Evolution of Net- [41 
works: From Biological Nets to the Internet and WWW 
(Oxford University Press, 2003). 

R. Ferrer i Cancho and R.V. Sole and R. Kohler, Phys. 
Rev. E 69, 051915 (2004). 

M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phy. 
Rev. E 64, 026118 (2001). 

M.E.J. Newman, Proc. Natl. Acad. Sci. 101, 5200 (2004). 
P. Sharp et al., Nucl. Acids Res. 16(17), 8207 (1988). 
S.B. Hedges, Nature Reviews 3, 838 (2002). 
Y. Nakamura, T. Gojobori, and T. Ikemura, Nucl. Acids 



Res. 28, 292 (2000). 

Codon usage database: http: / /www.kazusa. or.jp/codon/ 1 
M.E.J. Newman, SIAM^ReviewXsTlenWOlJ^ 
T. Kunkel and K. Bebenek, Annual Review of Biochem- 
istry 69, 497 (2000). 

P.-Y. Oudeyer, Self-organization in the Evolution of 
Speech, (Oxford University Press, 2006). 
B. Lindblom and I. Maddieson, Language, Speech, and 
Mind, 62 (1988). 

B. de Boer, Journal of Phonetics 28, 441 (2000). 

M. Choudhury et al.. Proceedings of COLING-ACL 

P06, 128 (2006). 

F. Hinskens and J. Weijer, Linguistics 41, 1041 (2003). 
P. Ladefoged and I. Maddieson, Sounds of the World's 
Languages, (Oxford, Blackwell, 1996). 
I. Maddieson, Patterns of Sounds, (Cambridge University 
Press, 1984). 

N. Johnson and S. Kotz, Urn Models and Their Appli- 
cations: AFn approach to Modern Discrete Probability 
Theory, (Wiley, New York, 1977). 

F. Chung, S. Handjani, and D. Jungreis, Annals of Com- 
binatorics 7, 141 (2003). 

A. Mukherjee et al.. Journal of Quantitative Linguistics, 

|http://arxiv.org/abs/ physics/0610120 (2008). 

W. Dahui, Z. Li, and D. Zengru, Physica A 363, 359 

(2006). 

Numerical evidence of the non-scale free character of the 
degree distribution of this type of system was first re- 
ported in 395. 

The names with and without replacement refer to the fact 
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uk has been selected by one of the fj, edges of node Vi, 
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